AllLife Bank has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget.
You as a Data scientist at AllLife bank has to build a model that will help marketing department to identify the potential customers who have higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.
Objective
To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
warnings.filterwarnings('ignore')
customers = pd.read_csv('Loan_Modelling.csv')
# copying data to another varaible to avoid any changes to original data
data=customers.copy()
data.head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5 | 6 | 37 | 13 | 29 | 92121 | 4 | 0.4 | 2 | 155 | 0 | 0 | 0 | 1 | 0 |
| 6 | 7 | 53 | 27 | 72 | 91711 | 2 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 7 | 8 | 50 | 24 | 22 | 93943 | 1 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 8 | 9 | 35 | 10 | 81 | 90089 | 3 | 0.6 | 2 | 104 | 0 | 0 | 0 | 1 | 0 |
| 9 | 10 | 34 | 9 | 180 | 93023 | 1 | 8.9 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
data.tail(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4990 | 4991 | 55 | 25 | 58 | 95023 | 4 | 2.00 | 3 | 219 | 0 | 0 | 0 | 0 | 1 |
| 4991 | 4992 | 51 | 25 | 92 | 91330 | 1 | 1.90 | 2 | 100 | 0 | 0 | 0 | 0 | 1 |
| 4992 | 4993 | 30 | 5 | 13 | 90037 | 4 | 0.50 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4993 | 4994 | 45 | 21 | 218 | 91801 | 2 | 6.67 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4994 | 4995 | 64 | 40 | 75 | 94588 | 3 | 2.00 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.90 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.40 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.30 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.50 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.80 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
data.shape
(5000, 14)
data[data.duplicated()].count()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
data.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
#checking unique categories and their numbers for each feature
print(data.Age.value_counts())
print(data.ZIPCode.value_counts())
print(data.Family.value_counts())
print(data.Education.value_counts())
print(data.Personal_Loan.value_counts())
print(data.Securities_Account.value_counts())
print(data.CD_Account.value_counts())
print(data.Online.value_counts())
print(data.CreditCard.value_counts())
35 151
43 149
52 145
58 143
54 143
50 138
41 136
30 136
56 135
34 134
39 133
59 132
57 132
51 129
60 127
45 127
46 127
42 126
40 125
31 125
55 125
62 123
29 123
61 122
44 121
32 120
33 120
48 118
38 115
49 115
47 113
53 112
63 108
36 107
37 106
28 103
27 91
65 80
64 78
26 78
25 53
24 28
66 24
23 12
67 12
Name: Age, dtype: int64
94720 169
94305 127
95616 116
90095 71
93106 57
...
94970 1
92694 1
94404 1
94598 1
94965 1
Name: ZIPCode, Length: 467, dtype: int64
1 1472
2 1296
4 1222
3 1010
Name: Family, dtype: int64
1 2096
3 1501
2 1403
Name: Education, dtype: int64
0 4520
1 480
Name: Personal_Loan, dtype: int64
0 4478
1 522
Name: Securities_Account, dtype: int64
0 4698
1 302
Name: CD_Account, dtype: int64
1 2984
0 2016
Name: Online, dtype: int64
0 3530
1 1470
Name: CreditCard, dtype: int64
Most features are boolean type.
# lets plot histogram of all plots
from scipy.stats import norm
all_col = data.select_dtypes(include=np.number).columns.tolist()
all_col.remove('ID')
plt.figure(figsize=(17,75))
for i in range(len(all_col)):
plt.subplot(18,3,i+1)
plt.hist(data[all_col[i]])
#sns.displot(df[all_col[i]], kde=True)
plt.tight_layout()
plt.title(all_col[i],fontsize=25)
plt.show()
Age and Experience looks uniformaly disributed. Income and CCAvg looks rightly skewed. Most other features are boolean.
#looking at counts of each boolean variable
print(data.Personal_Loan.value_counts())
print(data.Securities_Account.value_counts())
print(data.CD_Account.value_counts())
print(data.Online.value_counts())
print(data.CreditCard.value_counts())
0 4520 1 480 Name: Personal_Loan, dtype: int64 0 4478 1 522 Name: Securities_Account, dtype: int64 0 4698 1 302 Name: CD_Account, dtype: int64 1 2984 0 2016 Name: Online, dtype: int64 0 3530 1 1470 Name: CreditCard, dtype: int64
# outlier detection using boxplot
plt.figure(figsize=(20,30))
numeric_columns = ['Age', 'Experience', 'Income', 'ZIPCode', 'CCAvg', 'Mortgage']
for i, variable in enumerate(numeric_columns):
plt.subplot(5,4,i+1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
print(numeric_columns)
['Age', 'Experience', 'Income', 'ZIPCode', 'CCAvg', 'Mortgage']
CCavg and Martgage have a lot of outliers. We will leave them in the dataset
# Function to create barplots that indicate percentage for each category.
def perc_on_bar(plot, feature):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(feature) # length of the column
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size = 12) # annotate the percantage
plt.show() # show the plot
plt.figure(figsize=(10,10))
select_columns = ['Age','Experience','Income','Family','Mortgage']
for i, variable in enumerate(select_columns):
ax = sns.countplot(data[select_columns[i]])
perc_on_bar(ax,data[select_columns[i]])
plt.figure(figsize=(10,10))
plt.show()
<Figure size 720x720 with 0 Axes>
Most of the featues have unifrom distribution. Mortgage has a right skewed distribution
plt.figure(figsize=(15,7))
sns.heatmap(data.corr(),annot=True)
plt.show()
#Lets see these correlations on pairplot
sns.pairplot(data=data,hue="Personal_Loan")
plt.show()
dummy_data = pd.get_dummies(data, columns=['Family','Education','Securities_Account','CD_Account','Online','CreditCard'],drop_first=True)
dummy_data.head()
| ID | Age | Experience | Income | ZIPCode | CCAvg | Mortgage | Personal_Loan | Family_2 | Family_3 | Family_4 | Education_2 | Education_3 | Securities_Account_1 | CD_Account_1 | Online_1 | CreditCard_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 1.6 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 1.5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 2.7 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 1.0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
column_names = list(dummy_data.columns)
column_names.remove('Personal_Loan') # Keep only names of features by removing the name of target variable
column_names.remove('ID')
column_names.remove('Experience')
feature_names = column_names
print(feature_names)
dummy_data.drop('ID',axis=1)
dummy_data.drop('Experience',axis=1) # highly corelated with age
['Age', 'Income', 'ZIPCode', 'CCAvg', 'Mortgage', 'Family_2', 'Family_3', 'Family_4', 'Education_2', 'Education_3', 'Securities_Account_1', 'CD_Account_1', 'Online_1', 'CreditCard_1']
| ID | Age | Income | ZIPCode | CCAvg | Mortgage | Personal_Loan | Family_2 | Family_3 | Family_4 | Education_2 | Education_3 | Securities_Account_1 | CD_Account_1 | Online_1 | CreditCard_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 49 | 91107 | 1.6 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 34 | 90089 | 1.5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 11 | 94720 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 100 | 94112 | 2.7 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 45 | 91330 | 1.0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 29 | 40 | 92697 | 1.9 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 15 | 92037 | 0.4 | 85 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 24 | 93023 | 0.3 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 49 | 90034 | 0.5 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 83 | 92612 | 0.8 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
5000 rows × 16 columns
X = dummy_data.drop('Personal_Loan',axis=1) # Features
y = dummy_data['Personal_Loan'].astype('int64') # Labels (Target Variable)
# converting target to integers - since some functions might not work with bool type
# Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape)
(3500, 16) (1500, 16)
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from sklearn import metrics
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
logit = sm.Logit( y_train, X_train )
lg = logit.fit()
print(lg.summary2())
# Let's Look at Model Performance
y_pred = lg.predict(X_train)
pred_train = list(map(round, y_pred))
y_pred1 = lg.predict(X_test)
pred_test = list(map(round, y_pred1))
print('recall on train data:',recall_score(y_train, pred_train) )
print('recall on test data:',recall_score(y_test, pred_test))
Optimization terminated successfully.
Current function value: 0.107435
Iterations 10
Results: Logit
======================================================================
Model: Logit Pseudo R-squared: 0.657
Dependent Variable: Personal_Loan AIC: 786.0448
Date: 2021-04-17 02:27 BIC: 890.7736
No. Observations: 3500 Log-Likelihood: -376.02
Df Model: 16 LL-Null: -1095.5
Df Residuals: 3483 LLR p-value: 7.1236e-297
Converged: 1.0000 Scale: 1.0000
No. Iterations: 10.0000
----------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
----------------------------------------------------------------------
const -12.4948 5.5310 -2.2591 0.0239 -23.3353 -1.6542
ID -0.0000 0.0001 -0.6467 0.5178 -0.0002 0.0001
Age -0.0245 0.0826 -0.2969 0.7665 -0.1865 0.1374
Experience 0.0298 0.0824 0.3615 0.7177 -0.1317 0.1913
Income 0.0627 0.0039 16.2830 0.0000 0.0552 0.0703
ZIPCode -0.0000 0.0001 -0.0386 0.9692 -0.0001 0.0001
CCAvg 0.2459 0.0583 4.2174 0.0000 0.1316 0.3602
Mortgage 0.0009 0.0008 1.1880 0.2348 -0.0006 0.0024
Family_2 0.0351 0.2915 0.1203 0.9042 -0.5362 0.6064
Family_3 2.5002 0.3160 7.9122 0.0000 1.8809 3.1195
Family_4 1.6355 0.3106 5.2648 0.0000 1.0266 2.2443
Education_2 4.0136 0.3481 11.5317 0.0000 3.3315 4.6958
Education_3 4.2919 0.3487 12.3071 0.0000 3.6084 4.9754
Securities_Account_1 -1.0766 0.4050 -2.6584 0.0079 -1.8704 -0.2828
CD_Account_1 3.7143 0.4384 8.4726 0.0000 2.8550 4.5735
Online_1 -0.5822 0.2064 -2.8213 0.0048 -0.9867 -0.1778
CreditCard_1 -0.9898 0.2701 -3.6647 0.0002 -1.5192 -0.4604
======================================================================
recall on train data: 0.6918429003021148
recall on test data: 0.6308724832214765
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_series1 = pd.Series([variance_inflation_factor(X_train.values,i) for i in range(X_train.shape[1])],index=X_train.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))
Series before feature selection: const 3273.944020 ID 1.004123 Age 96.391733 Experience 96.303885 Income 1.895744 ZIPCode 1.007376 CCAvg 1.742701 Mortgage 1.047352 Family_2 1.402783 Family_3 1.385411 Family_4 1.426929 Education_2 1.302984 Education_3 1.345607 Securities_Account_1 1.148015 CD_Account_1 1.364334 Online_1 1.041426 CreditCard_1 1.112391 dtype: float64
#ZIPCode has highest p values among those with p-value greater than 0.05
X_train1 = X_train.drop('ZIPCode', axis =1)
X_test1 = X_test.drop('ZIPCode',axis =1)
logit1 = sm.Logit(y_train, X_train1 )
lg1 = logit1.fit()
print(lg1.summary2())
# Let's Look at Model Performance
y_pred = lg1.predict(X_train1)
pred_train = list(map(round, y_pred))
y_pred1 = lg1.predict(X_test1)
pred_test = list(map(round, y_pred1))
print('Recall on train data:',recall_score(y_train, pred_train) )
print('Recall on test data:',recall_score(y_test, pred_test))
Optimization terminated successfully.
Current function value: 0.107435
Iterations 10
Results: Logit
======================================================================
Model: Logit Pseudo R-squared: 0.657
Dependent Variable: Personal_Loan AIC: 784.0463
Date: 2021-04-17 02:27 BIC: 882.6146
No. Observations: 3500 Log-Likelihood: -376.02
Df Model: 15 LL-Null: -1095.5
Df Residuals: 3484 LLR p-value: 7.1535e-298
Converged: 1.0000 Scale: 1.0000
No. Iterations: 10.0000
----------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
----------------------------------------------------------------------
const -12.6899 2.2588 -5.6180 0.0000 -17.1170 -8.2628
ID -0.0000 0.0001 -0.6473 0.5174 -0.0002 0.0001
Age -0.0247 0.0826 -0.2988 0.7651 -0.1865 0.1372
Experience 0.0299 0.0823 0.3635 0.7162 -0.1315 0.1913
Income 0.0627 0.0039 16.2846 0.0000 0.0552 0.0703
CCAvg 0.2458 0.0582 4.2237 0.0000 0.1317 0.3598
Mortgage 0.0009 0.0008 1.1879 0.2349 -0.0006 0.0024
Family_2 0.0349 0.2915 0.1197 0.9047 -0.5364 0.6062
Family_3 2.5002 0.3160 7.9123 0.0000 1.8809 3.1196
Family_4 1.6351 0.3105 5.2664 0.0000 1.0266 2.2436
Education_2 4.0131 0.3477 11.5406 0.0000 3.3315 4.6946
Education_3 4.2917 0.3487 12.3082 0.0000 3.6082 4.9751
Securities_Account_1 -1.0766 0.4050 -2.6582 0.0079 -1.8704 -0.2828
CD_Account_1 3.7131 0.4373 8.4914 0.0000 2.8560 4.5701
Online_1 -0.5819 0.2062 -2.8219 0.0048 -0.9861 -0.1777
CreditCard_1 -0.9898 0.2701 -3.6647 0.0002 -1.5192 -0.4605
======================================================================
Recall on train data: 0.6918429003021148
Recall on test data: 0.6308724832214765
Not much change in recall, let's drop Family
#Family has highest p values among those with p-value greater than 0.05
X_train2 = X_train1.drop(['Family_2','Family_3','Family_4'], axis =1)
X_test2 = X_test1.drop(['Family_2','Family_3','Family_4'], axis =1)
logit2 = sm.Logit(y_train, X_train2 )
lg2 = logit2.fit()
print(lg2.summary2())
# Let's Look at Model Performance
y_pred = lg2.predict(X_train2)
pred_train = list(map(round, y_pred))
y_pred1 = lg2.predict(X_test2)
pred_test = list(map(round, y_pred1))
print('recall on train data:',recall_score(y_train, pred_train) )
print('recall on test data:',recall_score(y_test, pred_test))
Optimization terminated successfully.
Current function value: 0.121316
Iterations 9
Results: Logit
======================================================================
Model: Logit Pseudo R-squared: 0.612
Dependent Variable: Personal_Loan AIC: 875.2143
Date: 2021-04-17 02:27 BIC: 955.3010
No. Observations: 3500 Log-Likelihood: -424.61
Df Model: 12 LL-Null: -1095.5
Df Residuals: 3487 LLR p-value: 5.1195e-280
Converged: 1.0000 Scale: 1.0000
No. Iterations: 9.0000
----------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
----------------------------------------------------------------------
const -10.6699 2.0908 -5.1032 0.0000 -14.7679 -6.5720
ID -0.0000 0.0001 -0.4461 0.6555 -0.0002 0.0001
Age -0.0201 0.0784 -0.2568 0.7973 -0.1737 0.1334
Experience 0.0232 0.0783 0.2960 0.7672 -0.1303 0.1766
Income 0.0543 0.0033 16.6094 0.0000 0.0479 0.0607
CCAvg 0.2047 0.0515 3.9749 0.0001 0.1038 0.3057
Mortgage 0.0010 0.0007 1.5074 0.1317 -0.0003 0.0024
Education_2 4.1679 0.3173 13.1376 0.0000 3.5461 4.7897
Education_3 4.2037 0.3151 13.3393 0.0000 3.5860 4.8214
Securities_Account_1 -1.2100 0.3927 -3.0817 0.0021 -1.9796 -0.4405
CD_Account_1 3.7822 0.4076 9.2781 0.0000 2.9832 4.5811
Online_1 -0.6304 0.1953 -3.2276 0.0012 -1.0132 -0.2476
CreditCard_1 -1.0512 0.2540 -4.1385 0.0000 -1.5490 -0.5533
======================================================================
recall on train data: 0.676737160120846
recall on test data: 0.610738255033557
#Age has highest p values among those with p-value greater than 0.05
X_train3 = X_train2.drop('Age', axis =1)
X_test3 = X_test2.drop('Age',axis =1)
logit3 = sm.Logit(y_train, X_train3 )
lg3 = logit3.fit()
print(lg3.summary2())
# Let's Look at Model Performance
y_pred = lg3.predict(X_train3)
pred_train = list(map(round, y_pred))
y_pred1 = lg3.predict(X_test3)
pred_test = list(map(round, y_pred1))
print('recall on train data:',recall_score(y_train, pred_train) )
print('recall on test data:',recall_score(y_test, pred_test))
Optimization terminated successfully.
Current function value: 0.121326
Iterations 9
Results: Logit
=======================================================================
Model: Logit Pseudo R-squared: 0.612
Dependent Variable: Personal_Loan AIC: 873.2807
Date: 2021-04-17 02:27 BIC: 947.2069
No. Observations: 3500 Log-Likelihood: -424.64
Df Model: 11 LL-Null: -1095.5
Df Residuals: 3488 LLR p-value: 4.6798e-281
Converged: 1.0000 Scale: 1.0000
No. Iterations: 9.0000
-----------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-----------------------------------------------------------------------
const -11.1844 0.6113 -18.2973 0.0000 -12.3825 -9.9864
ID -0.0000 0.0001 -0.4454 0.6560 -0.0002 0.0001
Experience 0.0032 0.0081 0.3929 0.6944 -0.0127 0.0190
Income 0.0543 0.0033 16.6695 0.0000 0.0479 0.0607
CCAvg 0.2047 0.0515 3.9764 0.0001 0.1038 0.3056
Mortgage 0.0011 0.0007 1.5122 0.1305 -0.0003 0.0024
Education_2 4.1660 0.3173 13.1280 0.0000 3.5440 4.7879
Education_3 4.1926 0.3121 13.4339 0.0000 3.5809 4.8043
Securities_Account_1 -1.2075 0.3926 -3.0758 0.0021 -1.9769 -0.4381
CD_Account_1 3.7839 0.4075 9.2856 0.0000 2.9852 4.5826
Online_1 -0.6289 0.1952 -3.2220 0.0013 -1.0115 -0.2464
CreditCard_1 -1.0502 0.2539 -4.1367 0.0000 -1.5477 -0.5526
=======================================================================
recall on train data: 0.6737160120845922
recall on test data: 0.610738255033557
recall on train is same So lg3 is the final model that we will use for predictions and inferences Let's use model 'lg3' for making interpretations Education and CD_account are importanat variables
Check Model Performance
# Let's Look at Model Performance
y_pred = lg3.predict(X_train3)
pred_train = list(map(round, y_pred))
y_pred1 = lg3.predict(X_test3)
pred_test = list(map(round, y_pred1))
#Model performance with 0.5 threshold
print('Accuracy on train data:',accuracy_score(y_train, pred_train) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test))
print('Recall on train data:',recall_score(y_train, pred_train) )
print('Recall on test data:',recall_score(y_test, pred_test))
print('Precision on train data:',precision_score(y_train, pred_train) )
print('Precision on test data:',precision_score(y_test, pred_test))
print('f1 score on train data:',f1_score(y_train, pred_train))
print('f1 score on test data:',f1_score(y_test, pred_test))
Accuracy on train data: 0.9611428571428572 Accuracy on test data: 0.954 Recall on train data: 0.6737160120845922 Recall on test data: 0.610738255033557 Precision on train data: 0.8884462151394422 Precision on test data: 0.8921568627450981 f1 score on train data: 0.7663230240549828 f1 score on test data: 0.7250996015936254
#Lets check the AUC ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, lg3.predict(X_test3))
fpr, tpr, thresholds = roc_curve(y_test, lg3.predict(X_test3))
plt.figure(figsize=(13,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
Try to improve Recall using AUC-ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_test, lg3.predict(X_test3))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print(optimal_threshold)
0.12330383051949514
# Model prediction with optimal threshold
pred_train_opt = (lg.predict(X_train)>optimal_threshold).astype(int)
pred_test_opt = (lg.predict(X_test)>optimal_threshold).astype(int)
from sklearn.metrics import precision_recall_curve
y_scores=lg3.predict(X_train3)
prec, rec, tre = precision_recall_curve(y_train, y_scores, )
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], 'b--', label='precision')
plt.plot(thresholds, recalls[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
Decreasing threshold beyond 0.2 will lead to fast decrease in Precision, which will lead to great loss of opportunity, so let's consider threshold of 0.2
optimal_threshold = 0.2
# Model prediction with optimal threshold
pred_train_opt = (lg.predict(X_train)>optimal_threshold).astype(int)
pred_test_opt = (lg.predict(X_test)>optimal_threshold).astype(int)
#Model performance with optimal threhold
print('Accuracy on train data:',accuracy_score(y_train, pred_train_opt) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test_opt))
print('Recall on train data:',recall_score(y_train, pred_train_opt))
print('Recall on test data:',recall_score(y_test, pred_test_opt))
print('Precision on train data:',precision_score(y_train, pred_train_opt) )
print('Precision on test data:',precision_score(y_test, pred_test_opt))
print('f1 score on train data:',f1_score(y_train, pred_train_opt))
print('f1 score on test data:',f1_score(y_test, pred_test_opt))
Accuracy on train data: 0.9485714285714286 Accuracy on test data: 0.9453333333333334 Recall on train data: 0.8429003021148036 Recall on test data: 0.7919463087248322 Precision on train data: 0.6855036855036855 Precision on test data: 0.6982248520710059 f1 score on train data: 0.7560975609756097 f1 score on test data: 0.7421383647798743
from sklearn.metrics import classification_report,confusion_matrix
def make_confusion_matrix(y_actual,y_predict,labels=[1, 0]):
'''
y_predict: prediction of class
y_actual : ground truth
'''
cm=confusion_matrix( y_predict,y_actual, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ['1','0']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
# let us make confusion matrix on train set
make_confusion_matrix(y_train,pred_train_opt)
make_confusion_matrix(y_test,pred_test_opt)
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
model = DecisionTreeClassifier(criterion='gini',class_weight={0:0.15,1:0.85},random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight={0: 0.15, 1: 0.85},
criterion='gini', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort='deprecated', random_state=1, splitter='best')
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
make_confusion_matrix(model,y_test)
y_train.value_counts(1)
0 0.905429 1 0.094571 Name: Personal_Loan, dtype: float64
We only have 9% of positive classes, so if our model marks each sample as negative, then also we'll get 90% accuracy, hence accuracy is not a good metric to evaluate here.
True Positives:
True Negatives:
False Positives:
False Negatives:
The data imbalance is present in the sample or population itself, as only a small percentage of customers actually sign up for the targetted schemes such as a personal loan. So the model itself is not doing a poor job of misclassifying between the predicted and actual. The false positives and false negatives (Class 1 and Class 2 errors) are actually fairly low(~1%) in teh confusion matrix above
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
get_recall_score(model)
Recall on training set : 1.0 Recall on test set : 0.8859060402684564
feature_names = list(dummy_data.columns)
plt.figure(figsize=(20,30))
out = tree.plot_tree(model,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)
#below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model,feature_names=feature_names,show_weights=True))
|--- ZIPCode <= 98.50 | |--- Mortgage <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- Mortgage > 2.95 | | |--- CD_Account_1 <= 0.50 | | | |--- Mortgage <= 3.95 | | | | |--- ZIPCode <= 81.50 | | | | | |--- Experience <= 36.50 | | | | | | |--- Family_4 <= 0.50 | | | | | | | |--- Mortgage <= 3.50 | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- Mortgage > 3.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Family_4 > 0.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Experience > 36.50 | | | | | | |--- CCAvg <= 91269.00 | | | | | | | |--- Age <= 1184.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Age > 1184.50 | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | |--- CCAvg > 91269.00 | | | | | | | |--- weights: [5.55, 0.00] class: 0 | | | | |--- ZIPCode > 81.50 | | | | | |--- Age <= 934.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | |--- Age > 934.50 | | | | | | |--- CCAvg <= 95084.00 | | | | | | | |--- Mortgage <= 3.05 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | |--- Mortgage > 3.05 | | | | | | | | |--- Income <= 38.50 | | | | | | | | | |--- Family_4 <= 0.50 | | | | | | | | | | |--- CCAvg <= 90692.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- CCAvg > 90692.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- Family_4 > 0.50 | | | | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | | | |--- Education_3 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Income > 38.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- CCAvg > 95084.00 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | |--- Mortgage > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account_1 > 0.50 | | | |--- Age <= 766.50 | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- Age > 766.50 | | | | |--- weights: [0.00, 6.80] class: 1 |--- ZIPCode > 98.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- ZIPCode <= 100.00 | | | | | | |--- CCAvg <= 91169.00 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | |--- CCAvg > 91169.00 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- ZIPCode > 100.00 | | | | | | |--- ZIPCode <= 103.50 | | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- ZIPCode > 103.50 | | | | | | | |--- weights: [64.95, 0.00] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- ZIPCode <= 102.00 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- ZIPCode > 102.00 | | | | | | |--- weights: [0.00, 16.15] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- ZIPCode <= 108.50 | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | |--- ZIPCode > 108.50 | | | | | |--- Experience <= 26.00 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Experience > 26.00 | | | | | | |--- CCAvg <= 90019.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- CCAvg > 90019.50 | | | | | | | |--- ZIPCode <= 118.00 | | | | | | | | |--- ZIPCode <= 112.00 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- ZIPCode > 112.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- ZIPCode > 118.00 | | | | | | | | |--- weights: [0.00, 28.05] class: 1 | | |--- Education_2 > 0.50 | | | |--- ZIPCode <= 110.50 | | | | |--- Mortgage <= 3.54 | | | | | |--- ZIPCode <= 106.50 | | | | | | |--- weights: [3.90, 0.00] class: 0 | | | | | |--- ZIPCode > 106.50 | | | | | | |--- Income <= 27.00 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | |--- Income > 27.00 | | | | | | | |--- Mortgage <= 1.85 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- Mortgage > 1.85 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Mortgage > 3.54 | | | | | |--- weights: [0.00, 2.55] class: 1 | | | |--- ZIPCode > 110.50 | | | | |--- ZIPCode <= 116.50 | | | | | |--- Personal_Loan <= 141.50 | | | | | | |--- Income <= 35.50 | | | | | | | |--- Mortgage <= 1.20 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- Mortgage > 1.20 | | | | | | | | |--- CCAvg <= 94887.00 | | | | | | | | | |--- Mortgage <= 2.65 | | | | | | | | | | |--- ZIPCode <= 113.50 | | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | | |--- ZIPCode > 113.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- Mortgage > 2.65 | | | | | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | | | | |--- CCAvg > 94887.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Income > 35.50 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- Personal_Loan > 141.50 | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | |--- ZIPCode > 116.50 | | | | | |--- weights: [0.00, 91.80] class: 1 | |--- Education_3 > 0.50 | | |--- ZIPCode <= 116.50 | | | |--- Mortgage <= 2.45 | | | | |--- Experience <= 41.50 | | | | | |--- weights: [3.60, 0.00] class: 0 | | | | |--- Experience > 41.50 | | | | | |--- Income <= 31.50 | | | | | | |--- Mortgage <= 0.35 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | |--- Mortgage > 0.35 | | | | | | | |--- CCAvg <= 93596.00 | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- CCAvg > 93596.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Income > 31.50 | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | |--- Mortgage > 2.45 | | | | |--- Age <= 4852.50 | | | | | |--- CCAvg <= 90389.50 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- CCAvg > 90389.50 | | | | | | |--- CD_Account_1 <= 0.50 | | | | | | | |--- ZIPCode <= 99.50 | | | | | | | | |--- CCAvg <= 93882.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- CCAvg > 93882.50 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- ZIPCode > 99.50 | | | | | | | | |--- weights: [0.00, 11.05] class: 1 | | | | | | |--- CD_Account_1 > 0.50 | | | | | | | |--- Experience <= 46.50 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- Experience > 46.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Age > 4852.50 | | | | | |--- weights: [0.15, 0.00] class: 0 | | |--- ZIPCode > 116.50 | | | |--- weights: [0.00, 96.90] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Income 0.593288 Education_2 0.088134 CCAvg 0.084049 Family_4 0.071207 Family_3 0.070324 Education_3 0.037138 ZIPCode 0.012504 CD_Account_1 0.011338 Age 0.009041 Experience 0.008349 ID 0.008189 Mortgage 0.002947 Securities_Account_1 0.002769 CreditCard_1 0.000721 Online_1 0.000000 const 0.000000 Family_2 0.000000
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
According to the decision tree model, Zipcode is the most important variable for predicting the potential customer for loan. This could be due to ceratin zipcodes have houses where families with high income and hish education live and have high credit card usage.
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1,class_weight = {0:.15,1:.85})
# Grid of parameters to choose from
parameters = {
'max_depth': np.arange(1,10),
'criterion': ['entropy','gini'],
'splitter': ['best','random'],
'min_impurity_decrease': [0.000001,0.00001,0.0001],
'max_features': ['log2','sqrt']
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight={0: 0.15, 1: 0.85},
criterion='gini', max_depth=9, max_features='log2',
max_leaf_nodes=None, min_impurity_decrease=0.0001,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort='deprecated', random_state=1, splitter='best')
make_confusion_matrix(estimator,y_test)
get_recall_score(estimator)
Recall on training set : 0.9335347432024169 Recall on test set : 0.6912751677852349
Recall has reduced model overfitting after hyperparameter tuning and we have a generalized model.
plt.figure(figsize=(15,10))
out = tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
#feature_names.remove('Experience')
print(tree.export_text(estimator,feature_names=feature_names,show_weights=True))
|--- Education_3 <= 0.50 | |--- Family_3 <= 0.50 | | |--- ZIPCode <= 98.50 | | | |--- Age <= 974.50 | | | | |--- weights: [40.05, 0.00] class: 0 | | | |--- Age > 974.50 | | | | |--- Mortgage <= 2.95 | | | | | |--- weights: [150.75, 0.00] class: 0 | | | | |--- Mortgage > 2.95 | | | | | |--- Mortgage <= 4.20 | | | | | | |--- ZIPCode <= 74.50 | | | | | | | |--- weights: [2.25, 0.00] class: 0 | | | | | | |--- ZIPCode > 74.50 | | | | | | | |--- CCAvg <= 95084.00 | | | | | | | | |--- ZIPCode <= 82.50 | | | | | | | | | |--- weights: [0.90, 1.70] class: 1 | | | | | | | | |--- ZIPCode > 82.50 | | | | | | | | | |--- weights: [0.90, 7.65] class: 1 | | | | | | | |--- CCAvg > 95084.00 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- Mortgage > 4.20 | | | | | | |--- weights: [2.55, 0.00] class: 0 | | |--- ZIPCode > 98.50 | | | |--- CD_Account_1 <= 0.50 | | | | |--- Mortgage <= 1.05 | | | | | |--- Family_4 <= 0.50 | | | | | | |--- CCAvg <= 90270.50 | | | | | | | |--- CCAvg <= 90042.50 | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | |--- CCAvg > 90042.50 | | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | |--- Education_2 > 0.50 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | |--- CCAvg > 90270.50 | | | | | | | |--- Income <= 11.50 | | | | | | | | |--- Personal_Loan <= 401.00 | | | | | | | | | |--- weights: [3.30, 0.00] class: 0 | | | | | | | | |--- Personal_Loan > 401.00 | | | | | | | | | |--- weights: [0.15, 0.85] class: 1 | | | | | | | |--- Income > 11.50 | | | | | | | | |--- weights: [7.80, 0.00] class: 0 | | | | | |--- Family_4 > 0.50 | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | |--- Mortgage > 1.05 | | | | | |--- Family_4 <= 0.50 | | | | | | |--- CCAvg <= 94708.50 | | | | | | | |--- Mortgage <= 8.90 | | | | | | | | |--- Age <= 474.50 | | | | | | | | | |--- weights: [3.75, 4.25] class: 1 | | | | | | | | |--- Age > 474.50 | | | | | | | | | |--- weights: [40.50, 15.30] class: 0 | | | | | | | |--- Mortgage > 8.90 | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | |--- CCAvg > 94708.50 | | | | | | | |--- ZIPCode <= 113.00 | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | |--- ZIPCode > 113.00 | | | | | | | | |--- Age <= 3015.50 | | | | | | | | | |--- weights: [4.80, 10.20] class: 1 | | | | | | | | |--- Age > 3015.50 | | | | | | | | | |--- weights: [3.75, 0.85] class: 0 | | | | | |--- Family_4 > 0.50 | | | | | | |--- Income <= 34.00 | | | | | | | |--- Mortgage <= 2.55 | | | | | | | | |--- Mortgage <= 2.15 | | | | | | | | | |--- weights: [0.45, 5.10] class: 1 | | | | | | | | |--- Mortgage > 2.15 | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | |--- Mortgage > 2.55 | | | | | | | | |--- weights: [0.15, 15.30] class: 1 | | | | | | |--- Income > 34.00 | | | | | | | |--- Experience <= 62.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- Experience > 62.50 | | | | | | | | |--- Age <= 2250.00 | | | | | | | | | |--- weights: [0.30, 2.55] class: 1 | | | | | | | | |--- Age > 2250.00 | | | | | | | | | |--- weights: [0.45, 0.85] class: 1 | | | |--- CD_Account_1 > 0.50 | | | | |--- CCAvg <= 95098.00 | | | | | |--- CreditCard_1 <= 0.50 | | | | | | |--- CCAvg <= 91545.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- CCAvg > 91545.50 | | | | | | | |--- weights: [0.15, 8.50] class: 1 | | | | | |--- CreditCard_1 > 0.50 | | | | | | |--- Income <= 11.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- weights: [0.15, 3.40] class: 1 | | | | | | |--- Income > 11.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- Family_2 <= 0.50 | | | | | | | | | |--- weights: [0.90, 6.80] class: 1 | | | | | | | | |--- Family_2 > 0.50 | | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- weights: [0.00, 13.60] class: 1 | | | | |--- CCAvg > 95098.00 | | | | | |--- weights: [0.30, 0.00] class: 0 | |--- Family_3 > 0.50 | | |--- Age <= 180.50 | | | |--- weights: [3.15, 0.00] class: 0 | | |--- Age > 180.50 | | | |--- CD_Account_1 <= 0.50 | | | | |--- ZIPCode <= 113.50 | | | | | |--- Experience <= 36.50 | | | | | | |--- ZIPCode <= 106.50 | | | | | | | |--- Income <= 11.50 | | | | | | | | |--- CCAvg <= 91362.50 | | | | | | | | | |--- weights: [2.40, 0.85] class: 0 | | | | | | | | |--- CCAvg > 91362.50 | | | | | | | | | |--- weights: [11.10, 0.00] class: 0 | | | | | | | |--- Income > 11.50 | | | | | | | | |--- CCAvg <= 94204.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- CCAvg > 94204.50 | | | | | | | | | |--- weights: [0.30, 0.85] class: 1 | | | | | | |--- ZIPCode > 106.50 | | | | | | | |--- Income <= 5.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Income > 5.00 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- Experience > 36.50 | | | | | | |--- Education_2 <= 0.50 | | | | | | | |--- weights: [19.95, 0.00] class: 0 | | | | | | |--- Education_2 > 0.50 | | | | | | | |--- Mortgage <= 3.35 | | | | | | | | |--- weights: [26.70, 0.00] class: 0 | | | | | | | |--- Mortgage > 3.35 | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | |--- ZIPCode > 113.50 | | | | | |--- weights: [0.15, 38.25] class: 1 | | | |--- CD_Account_1 > 0.50 | | | | |--- Income <= 26.00 | | | | | |--- Education_2 <= 0.50 | | | | | | |--- ZIPCode <= 68.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- ZIPCode > 68.50 | | | | | | | |--- CCAvg <= 92297.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- CCAvg > 92297.50 | | | | | | | | |--- weights: [0.15, 5.10] class: 1 | | | | | |--- Education_2 > 0.50 | | | | | | |--- weights: [0.15, 9.35] class: 1 | | | | |--- Income > 26.00 | | | | | |--- Income <= 31.50 | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | |--- Income > 31.50 | | | | | | |--- ZIPCode <= 92.50 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | |--- ZIPCode > 92.50 | | | | | | | |--- weights: [0.00, 5.10] class: 1 |--- Education_3 > 0.50 | |--- Experience <= 38.50 | | |--- Personal_Loan <= 317.00 | | | |--- ZIPCode <= 90.00 | | | | |--- weights: [32.25, 0.00] class: 0 | | | |--- ZIPCode > 90.00 | | | | |--- Personal_Loan <= 285.50 | | | | | |--- ZIPCode <= 116.50 | | | | | | |--- Personal_Loan <= 169.50 | | | | | | | |--- Mortgage <= 2.40 | | | | | | | | |--- weights: [2.40, 0.00] class: 0 | | | | | | | |--- Mortgage > 2.40 | | | | | | | | |--- Experience <= 35.50 | | | | | | | | | |--- weights: [0.15, 5.95] class: 1 | | | | | | | | |--- Experience > 35.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Personal_Loan > 169.50 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | |--- ZIPCode > 116.50 | | | | | | |--- weights: [0.00, 34.85] class: 1 | | | | |--- Personal_Loan > 285.50 | | | | | |--- weights: [0.45, 0.00] class: 0 | | |--- Personal_Loan > 317.00 | | | |--- ZIPCode <= 117.50 | | | | |--- weights: [0.30, 0.00] class: 0 | | | |--- ZIPCode > 117.50 | | | | |--- weights: [0.00, 6.80] class: 1 | |--- Experience > 38.50 | | |--- Income <= 18.50 | | | |--- ZIPCode <= 109.50 | | | | |--- Experience <= 39.50 | | | | | |--- ZIPCode <= 91.00 | | | | | | |--- weights: [4.05, 0.00] class: 0 | | | | | |--- ZIPCode > 91.00 | | | | | | |--- Family_4 <= 0.50 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | |--- Family_4 > 0.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Experience > 39.50 | | | | | |--- weights: [17.85, 0.00] class: 0 | | | |--- ZIPCode > 109.50 | | | | |--- Mortgage <= 1.05 | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- Mortgage > 1.05 | | | | | |--- weights: [0.00, 6.80] class: 1 | | |--- Income > 18.50 | | | |--- ZIPCode <= 98.50 | | | | |--- weights: [76.35, 1.70] class: 0 | | | |--- ZIPCode > 98.50 | | | | |--- Experience <= 65.50 | | | | | |--- Income <= 38.50 | | | | | | |--- ZIPCode <= 113.50 | | | | | | | |--- ZIPCode <= 111.50 | | | | | | | | |--- Mortgage <= 1.75 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- Mortgage > 1.75 | | | | | | | | | |--- weights: [0.15, 6.80] class: 1 | | | | | | | |--- ZIPCode > 111.50 | | | | | | | | |--- Experience <= 54.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- Experience > 54.50 | | | | | | | | | |--- weights: [0.30, 0.85] class: 1 | | | | | | |--- ZIPCode > 113.50 | | | | | | | |--- weights: [0.00, 49.30] class: 1 | | | | | |--- Income > 38.50 | | | | | | |--- Mortgage <= 3.15 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | |--- Mortgage > 3.15 | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | |--- Experience > 65.50 | | | | | |--- weights: [0.45, 0.00] class: 0
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
#Here we will see that importance of features has increased
Imp Income 0.666246 CCAvg 0.082284 CD_Account_1 0.061623 Family_4 0.046606 Family_3 0.027189 Education_3 0.019513 Age 0.017450 ZIPCode 0.017048 Mortgage 0.016053 Experience 0.015991 ID 0.014611 Education_2 0.009812 Family_2 0.005310 CreditCard_1 0.000263 Securities_Account_1 0.000000 Online_1 0.000000 const 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
The DecisionTreeClassifier provides parameters such as
min_samples_leaf and max_depth to prevent a tree from overfiting. Cost
complexity pruning provides another option to control the size of a tree. In
DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha. Greater values of ccp_alpha
increase the number of nodes pruned. Here we only show the effect of
ccp_alpha on regularizing the trees and how to choose a ccp_alpha
based on validation scores.
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.
clf = DecisionTreeClassifier(random_state=1,class_weight = {0:0.15,1:0.85})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | -5.887153e-15 |
| 1 | 0.000197 | 3.944694e-04 |
| 2 | 0.000337 | 7.314589e-04 |
| 3 | 0.000364 | 1.095772e-03 |
| 4 | 0.000369 | 2.201519e-03 |
| 5 | 0.000370 | 3.682792e-03 |
| 6 | 0.000372 | 5.913895e-03 |
| 7 | 0.000374 | 6.288328e-03 |
| 8 | 0.000388 | 6.676230e-03 |
| 9 | 0.000389 | 7.453413e-03 |
| 10 | 0.000393 | 7.846223e-03 |
| 11 | 0.000586 | 8.432291e-03 |
| 12 | 0.000645 | 1.036750e-02 |
| 13 | 0.000655 | 1.102215e-02 |
| 14 | 0.000655 | 1.167762e-02 |
| 15 | 0.000676 | 1.235343e-02 |
| 16 | 0.000879 | 1.323240e-02 |
| 17 | 0.000909 | 1.414174e-02 |
| 18 | 0.000940 | 1.508217e-02 |
| 19 | 0.000941 | 1.696372e-02 |
| 20 | 0.000995 | 1.895399e-02 |
| 21 | 0.001011 | 1.996515e-02 |
| 22 | 0.001013 | 2.097832e-02 |
| 23 | 0.001019 | 2.199727e-02 |
| 24 | 0.001116 | 2.311322e-02 |
| 25 | 0.001471 | 2.458383e-02 |
| 26 | 0.001638 | 2.622188e-02 |
| 27 | 0.001686 | 2.959469e-02 |
| 28 | 0.001844 | 3.143833e-02 |
| 29 | 0.002603 | 3.404096e-02 |
| 30 | 0.002742 | 3.678339e-02 |
| 31 | 0.003336 | 4.011939e-02 |
| 32 | 0.003410 | 4.352930e-02 |
| 33 | 0.003527 | 4.705652e-02 |
| 34 | 0.004797 | 5.665076e-02 |
| 35 | 0.005138 | 6.178904e-02 |
| 36 | 0.006726 | 6.851486e-02 |
| 37 | 0.022532 | 9.104708e-02 |
| 38 | 0.030573 | 2.133399e-01 |
| 39 | 0.253796 | 4.671356e-01 |
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha,class_weight = {0:0.15,1:0.85})
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.25379571489481034
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(X_train)
values_train=metrics.recall_score(y_train,pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post",)
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
Maximum value of Recall is at 0.014 alpha, but if we choose decision tree will only have a root node and we would lose the buisness rules, instead we can choose alpha 0.006 retaining information and getting higher recall.
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.006725813690406864,
class_weight={0: 0.15, 1: 0.85}, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=1, splitter='best')
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.006725813690406864,
class_weight={0: 0.15, 1: 0.85}, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=1, splitter='best')
make_confusion_matrix(best_model,y_test)
get_recall_score(best_model)
Recall on training set : 0.9909365558912386 Recall on test set : 0.9865771812080537
plt.figure(figsize=(10,10))
out = tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
comparison_frame = pd.DataFrame({'Model':['Logistic Regression','Initial decision tree model','Decision treee with hyperparameter tuning',
'Decision tree with post-pruning'], 'Train_Recall':[0.84,1,0.933,0.99], 'Test_Recall':[0.79,0.886,0.69,0.98]})
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Logistic Regression | 0.850 | 0.780 |
| 1 | Initial decision tree model | 1.000 | 0.886 |
| 2 | Decision treee with hyperparameter tuning | 0.933 | 0.690 |
| 3 | Decision tree with post-pruning | 0.990 | 0.980 |
Decision tree model with post pruning has given the best recall score on data.
According to the decision tree model, Zipcode is the most important variable for predicting the potential customer for loan. This could be due to certain zipcodes have houses where families with high income and hish education live and have high credit card usage.
Website could be used for personalized loan offers could be provided from within the dashboard itself. Provides easy tracking and immediate response from eth customer.
When someone has a CD_Account, the probability of them having take a personal loan is high. So good base for marketing/sales effort
When someone has a high credir card balance, the probability of them having take a personal loan is high. So good base for marketing/sales effort
Family with size of 4 or more have higher probability of taking a personal loan
Education level of 2 and up have higher probability of taking a personal loan